Remarque: En raison du nombre important de graphiques, seule une minorité des résultats concernant l’analyse ont intégrés dans ce rapport HTML. Tous les résultats et visualisations ont été exportés dans le dossier PCA_results/ afin d’alléger le fichier final et de préserver la lisibilité.

1 I. Data Extraction Pipeline

To analyze voting behavior patterns across different types of elections in Paris since 2000, I designed a data pipeline that is structured, reproducible, and compatible with diverse data formats.

Although the Paris open data portal provides an API and structured endpoints, I explicitly chose to rely on cleaned Excel files as the core data source, due to the incompleteness and inconsistency of API-provided datasets across election years and election types. Our approach guarantees that data access is aligned with available official documents and avoids patchiness in long-term comparative analyses.

1.1 Pipeline Design Principles

  • Directory-driven architecture: Each election type has its own folder, subdivided by year and round (e.g., Municipales/2020-01).
  • Batch-compatible Excel I/O: All raw data is stored in .xls or .xlsx format, downloaded directly from the Paris open data portal.
  • Type-specific loaders: I implemented five distinct extractors, one for each election type:
    • Présidentielles
    • Législatives
    • Régionales
    • Européennes
    • Municipales
  • Robust filename parsing: Each extractor uses pattern-matching logic (e.g., with str_match or str_extract) to retrieve year and round information from file names.
  • Automated storage: Cleaned matrices are saved as .xlsx files in structured subfolders under *_processed/.

1.2 Example Output

  • Régionales_processed/vote_matrix_2015_2eme.xlsx
  • Législatives_processed/2022-01/vote_matrix_Circ_11.xlsx
  • Municipales_processed/2008-02/vote_matrix_Ardt_19.xlsx

I emphasize again that while a fully API-based pipeline was conceptually desirable, data gaps and format heterogeneity in live endpoints made Excel-based pipelines a more reliable and uniform choice for this project.


2 II. Data Cleaning Strategy

The diversity of sources and formats requires a robust cleaning pipeline to ensure comparability and quality across datasets. Our data cleaning logic includes:

2.1 Standardizing Variables and Formats

  • Column name normalization using janitor::clean_names() to ensure consistent naming across datasets (e.g., ID_BVOTEid_bvote)
  • Type enforcement (e.g., ensuring id_bvote is always character)
  • Filling missing values only for numerical variables: all NAs in vote counts are converted to 0 via mutate(across(..., replace_na(...)))

2.2 Adding Geographic Hierarchies

  • Extracting arrondissement codes from id_bvote (e.g., "18-3" → 18)
  • Assigning each polling station to a region_group (Nord-Est, Sud-Ouest, etc.) based on arrondissement code

2.3 Harmonizing Voting Status Labels

To avoid inconsistencies in vote category names (e.g., "NB_BL" vs "NB_BL_NUL"), I excluded all vote count columns prefixed with nb_ and focused only on expressed vote columns (i.e., actual votes per candidate or list).

2.4 Optional Party Mapping

A future extension of this pipeline would involve mapping candidate names to standardized party labels (e.g., "Jean-Luc MÉLENCHON""LFI"), particularly for cross-election comparisons. While not mandatory for matrix decomposition, this becomes relevant in PCA/CCA extensions discussed in Section III.


In summary, our extraction and cleaning process was designed to be robust to filename irregularities, structurally scalable across elections, and ultimately Excel-centric to ensure maximal compatibility with real-world data completeness and archival formats.

3 PCA Analysis

I conducted principal component analysis (PCA) on the first round of the 2017 and 2022 presidential elections in Paris to assess the structure and stability of voter preferences. These two elections featured the same leading candidates—Emmanuel Macron and Marine Le Pen—providing a natural comparison. In both years, the first principal component (PC1) explained over one-third of the variance, and the variable plots consistently revealed a dominant ideological axis opposing Macron (center) and Le Pen (far-right). Despite the emergence of new candidates in 2022, the PCA results showed a remarkably stable structure, suggesting that the underlying political space and polarization patterns in Paris remained largely unchanged across these two electoral cycles.

## [1] "./Présidentielles_processed/vote_matrix_2017_1er.xlsx"
## [2] "./Présidentielles_processed/vote_matrix_2022_1er.xlsx"

I performed PCA on the legislative election data at the constituency level to assess whether the vote distributions exhibit clear groupings or extreme outliers across polling stations. In most constituencies, the first two principal components explain a modest proportion of the variance (typically 25–30%), and the resulting individual plots show relatively compact clusters without pronounced extremities. While some constituencies display mild elongation along specific axes, indicating potential latent polarization, I do not observe distinct subgroups or isolated voting stations. This suggests that legislative voting behavior in Paris tends to be moderately structured, but lacks strong clustering or extreme anomalies.

Principal component analysis (PCA) applied to municipal elections reveals a relatively noisy structure. The first principal component typically explains around 30–40% of the total variance, indicating the absence of a dominant political axis. The individual factor maps show widely scattered points, suggesting that voting patterns vary greatly across polling stations without forming clear ideological clusters. Furthermore, preliminary checks suggest that voter turnout may drive part of the principal components, as high/low participation areas tend to align along the same direction in the PCA space. This supports the idea that local context and mobilization, rather than stable partisan alignments, play a larger role in municipal elections in Paris.

In the European Parliament elections, principal component analysis (PCA) reveals that the first principal component (PC1) consistently explains only around 18% to 22% of the total variance, indicating a relatively diffuse and multidimensional voting structure. Despite the limited explanatory power of PC1, the variable plots suggest a latent opposition between candidates associated with more extreme or nationalist platforms (e.g., bardella_jordan, maréchal_marion) and those aligned with mainstream center-left parties (e.g., glucksmann_raphael, toussaint_marie), often located on opposite ends of the axis. This suggests that PC1 may still capture a “mainstream vs. radical” ideological divide, although less sharply than in other elections. Overall, the PCA highlights the fragmented and complex nature of the European election space in Paris, where no single dimension fully dominates the political landscape.

The PCA results for the regional elections reveal a consistent regional structure in voter behavior. The first principal component explains a meaningful share of the variance across years and appears to align with a political-ideological gradient separating different types of candidates. Notably, Valérie Pécresse and Nicolas Dupont-Aignan often occupy one end of the first axis, while candidates like Pierre Laurent, Olivier Besancenot, or Julien Bayou appear on the opposite end. This suggests that the primary dimension may capture a left–right or establishment–anti-establishment divide. Moreover, when coloring individuals by region_group, distinct clustering emerges in some years, indicating the presence of territorial voting patterns, such as a contrast between Centre/North-East and South-West areas. These results confirm that regional polarization and candidate ideology jointly shape electoral variation in Paris during the regional contests.

3.1 Canonical Correlation Analysis (CCA)

To evaluate the structural similarity between voting behaviors in the 2022 presidential election and the 2024 European election, I conducted a canonical correlation analysis (CCA) at the polling station level.

The analysis yielded a series of 14 canonical correlations, with the first few axes showing very strong correlations:

  • Axis 1: 0.993
  • Axis 2: 0.968
  • Axis 3: 0.911
  • Axis 4: 0.846
  • Axis 5: 0.740

These values indicate that a substantial portion of the variance in voting results from one election can be linearly predicted from the other, suggesting a high degree of ideological and behavioral alignment across the two electoral contexts.

Beyond the first few axes, the canonical correlations gradually decline (e.g., Axis 6: 0.536, Axis 10: 0.257), which is expected as deeper dimensions capture more election-specific noise or candidate-specific idiosyncrasies.

This analysis confirms that voters tend to exhibit consistent political orientations across national (présidentielles) and European (européennes) elections, at least in the first few dominant ideological dimensions.

## [1] 0.993
## [1] 0.968
## [1] 0.911
## [1] 0.846
## [1] 0.74
## [1] 0.536
## [1] 0.483
## [1] 0.399
## [1] 0.34
## [1] 0.257
## [1] 0.241
## [1] 0.18
## [1] 0.149
## [1] 0.111

3.2 Feel free to combine different methods

Voting Shift Visualization (by Distance):

This arrow plot illustrates the shifts in voting preferences between the 2017 and 2022 French presidential elections for each polling station, represented in PCA space. Each arrow connects a point’s position in the 2017 PCA configuration to its position in 2022. The color gradient encodes the Euclidean shift distance: darker arrows indicate small shifts, while yellow highlights larger preference changes. Most arrows are short and cluster around the origin, implying that for a majority of polling stations, electoral preferences remained relatively stable. However, a few long arrows in the upper-right quadrant signal dramatic shifts—possibly due to candidate turnover, voter realignment, or local political mobilizations.


Voting Shift by Region Group:

This plot disaggregates the same shift vectors by region group. Color coding shows that different geographic zones exhibit different voting dynamics. For instance, arrows from the Nord-Est and Nord-Ouest regions appear more dispersed, suggesting greater heterogeneity or volatility in those areas. In contrast, Sud-Ouest and Centre regions show denser, shorter shifts, indicating more stable patterns. This spatial structuring suggests that regional political culture or local socioeconomic factors may mediate how national political changes translate into local voting behavior.


4 clustering

The hierarchical clustering based on PCA results reveals four distinct clusters of voting behavior across polling stations. Cluster 1 (green) is concentrated on the left side of the PCA space, indicating a group of stations with similar preference patterns that are relatively distinct from the others. Cluster 4 (pink), by contrast, is more spread out on the right side, possibly reflecting more diverse or polarized voting tendencies. Clusters 2 and 3 occupy intermediate positions and partially overlap, suggesting transitional or mixed voting profiles.